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Abstract. Simultaneous matrix diagonalization is used as a subroutine 
in many machine learning problems, including blind source separation 
and paramater estimation in latent variable models. Here, we extend al¬ 
gorithms for performing joint diagonalization to low-rank and asymmet¬ 
ric matrices, and we also provide extensions to the perturbation analysis 
of these methods. Our results allow joint diagonalization to be applied 
in several new settings. 


1 Introduction 

Consider a set ot L ^ 2 matrices At = of the form 

Ml = UAiU^, ( 1 ) 

where U G ^d.xk factors common to all the matrices, and the diagonal Ai G 
IRfcxfe weights that are specific to each matrix Mi. 

Simultaneous matrix diagonalization consists in determining the unknown 
factors and weights from the matrices Mi. Unlike in traditional single-matrix di- 
agonalization, the U may be non-orthogonal (such factors are identifiable when 
A A 2; see Afsari [2]), and when the U are orthogonal, simultaneously diagonal¬ 
izing the entire set Ad if often more robust to noise that diagonalizing a single 
Ml. 

Joint diagonalization arises in several machine learning settings, including 
blind-source separation [12] and latent variable estimation via teirsor factor¬ 
ization [7]. However, our understanding of algorithms for jointly diagonalizing 
matrices is far from complete: even the low-rank (fc < d) and the asymmetric 
settings have not been considered in the literature. 

Here, we show how to extend existing algorithms — notably the Jacobi [5] 
and QRJID [1] methods — to these two settings. Our extensions enable one 
to apply simultaneous diagonalization to several new problems. For example, it 
was recently shown that tensor factorization can be reduced joint matrix diago- 
nalization; our algorithms make this reduction also applicable to low-rank and 
asymmetric tensors. This in turn leads to accuracy improvements on problems 
in community detection and for topic modeling. 


Finally, we also extend existing perturbation analyses for noisy matrices of 
the form 


Ml = UAiU^ + eRi ( 2 ) 

for some e > 0 and some matrix Ri having unit norm. First, we give a simple 
generalization of existing bounds to the low-rank and asymmetric settings. These 
bounds can be used to derive formal guarantees for methods that use simulta¬ 
neous diagonalization as a subroutine; we again provide tensor factorization as 
an example. However, current bounds only hold for the true joint diagonalizer of 
the set Af, and there are no guarantees on whether it is attained by joint diago¬ 
nalization algorithms; this limits the usefulness of theoretical analyses based on 
these lemmas. In the last section of the paper, we address this shortcoming by 
showing that for sufficiently small noise, the global minimizer will be attained, 
as long as we initialize the diagonalization subroutine with factors obtained by 
diagonalizing a single matrix. 


2 Background 

2.1 Notation 

We establish notation for the classical symmetric simultaneous diagonalization 
case; we will extend it to the asymmetric case in a subsequent section. 

We are given as input a set of matrices Mi ,..., Ml e ^ where each Mi 
can be expressed as 


Ml = UAiU^ + eRi. (3) 

The diagonal weight matrix Ai e and the noise Ri (satisfying ||i?i|| ^ 1) 
are individual to each Mi, but the non-singular transform U e is common 
to all the matrices. 

Our goal is to find an invertible transform V~^ such that each V~^MiV~~^ is 
nearly diagonal. This problem admits a unique solution when there are at least 
two matrices [2]. There are a number of objective functions that are designed for 
determining joint diagonalizers [5,9,1], but in this paper, we focus on a popular 
one that penalizes off-diagonal terms: 

Fiy) = Y, oK{V-^MiV-^) = Y Yiy~"MiV-^%. (4) 

1=1 1=1i^j 

An important setting of this problem, which we refer to as the orthogonal case, 
is when U is orthogonal, in which case we will also constrain the optimization 
to be orthogonal = V~^. 


2.2 Previous work 


There exist several algorithms for optimizing F{V). In this paper, we will use 
the Jacobi method [3,5] for the orthogonal case and the QRJID algorithm [1] for 
the non-orthogonal case. Both techniques are based on same idea of iteratively 
constructing V~^ via a product of simple matrices V~^ = Bt ■ ■ ■ B 2 B 1 , where 
at each iteration t = 1,... ,T, we choose Bt to minimize JiV). Typically, this 
can be done in closed form. 

The Jacobi algorithm for the orthogonal case is a simple adaptation of the 
Jacobi method for diagonalizing a single matrix. Each Bt is chosen to be a Givens 
rotation [3] defined by two of the d axes i < j e [d]: Bt = cos0(Aii + Ajj) + 
sm0(Aij — Aji) for some angle 0, with Ay being a matrix which is 1 in the 
(i,j)-th entry and 0 elsewhere. We sweep over all i < j, compute the best angle 
0 in closed form using the formula proposed by [5] to obtain Bt , and then update 
each Ml by BjMiBt. The above can be done in 0{d?L) time per sweep. 


Data: symmetric matrices 
Let U = r, 

while objective is decreasing do 
for i = 1, 2,..., d do 

for j = i + l,i + “2,d do 

Let 9 be the minimizer of F{Gij W); 

u^uGijiey, 

for I = 1, 2,..., L do 

I Ml ^Gij{efM,Gij{ey, 

end 

end 

end 

end 

Algorithm 1: The Jacobi algorithm for simultaneous diagonalization 


For the non-orthogonal case, the QRJID algorithm is similar, except that Bt 
is chosen to be either a lower or upper unit triangular matrix {Bt = I + aAy 
for some a and i ¥= j). The optimal value of a that minimizes J{V) can also be 
computed in closed form (see [1] for details). The running time per iteration is 
the same as before. 

Most popular algorithms both in the orthogonal [3] and the non-orthogonal 
setting [12,10] are guaranteed to converge locally. The formal question of global 
convergence is currently open even for methods that are widely used [8,5]. In 
practice, however, the Jacobi-style methods we adopt are well-known to behave 
as if they global convergence [3,5,8]. 

In the asymmetric setting, variants of the Jacobi algorithm have been pro¬ 
posed for orthogonal factors [6] . These variants essentially diagonalize the set of 
matrices Mj = Mj'Mj. Notice that when Mj = UAV'^, we have Mj = VA'^V'^^ 
and so the assymetric problem can be reduced to one with symmetric Mj. In the 





Data: symmetric matrices 
Let [/ = 7; 

while objective is decreasing do 
for i = 1,2, ...,d do 

for j = i + l,i + 2,..., d do 

Let 9 be the minimizer of F{Gij W); 

u^uGijiey, 

for I = 1,2,..., L do 

I Ml ^Gij{efMiGij{ey, 

end 

end 

end 

for i = 1, 2,d do 

for j = i + l,i + 2,..., d do 

Let d be the minimizer of F[Aij (a)); 

U ^UAij {ay 
for I = 1,2,..., L do 
I Ml Aij{d)'^MiAij{a)-, 

end 
end 
end 
end 

Algorithm 2: The QRJID algorithm for simultaneous non-orthogonal diago- 
nalization 







low-rank setting, an extension of the Jacobi algorithm has been proposed for a 
single matrix [11]. This extension implictely sorts the entries of the matrix and 
performs only 0{dk^) Givens rotations per sweep. However, the L ^ 2 setting 
has not been considered in the literature. Finally, for non-orthogonal matrices, 
neither the low-rank and asymmetric case has been worked out to the best of 
our knowledge. 


3 Low-rank matrices 

In this section, we provide extensions of simultaneous algorithms to low-rank 
matrices. We start with orthogonal factors; in this setting, we take inspiration 
from the sorted Jacobi method for a single low-rank matrix and propose a sorted 
Jacobi method for multiple matrices. Then, we show how to generalize the same 
sorting idea to the QRJID algorithm, which leads to a low-rank algorithm for 
the non-orthogonal setting as well. 


3.1 Orthogonal setting 

First, suppose that we applied the Jacobi algorithm to a single matrix M whose 
eigenvalues Ai ^ A 2 ^ ... ^ \k (corresponding to the diagonal of the matrix A) 
were sorted. In that case, we would only need Jacobi to zero out the entires of 
M associated with the first k rows or the first k columns. In other words, we 
would have to transform the matrix into the following form: 

/Ai 0 0 0 \ 

0 A 2 0 0 
0 0 X X 
V 0 0 X x/ 

The two left-most columns of the diagonalizing matrix will correspond to the 
eigenvectors associated with Ai and A 2 while the other two columns will contain 
arbitrary numbers. Also, since Givens rotations are essentially independent of 
each other, it takes only kd Givens rotations per sweep to turn the input matrix 
into the above form. 

If the eigenvalues of the input matrix M are not sorted, then we can sort 
the diagonal of M after every sweep of Jacobi (while still performing only kd 
Givens rotations per sweep). When the algorithm terminates, all the k non-zero 
eigenvalues will find themselves in the top k x k corner; if that wasn’t the case, 
then they would have been swapped out with another entry from outside that 
corner. 

When there is more than one matrix, the components j over which the rank 
is positive are ones for which I I > 0- This suggests a natural extension 
of the above idea: choose Givens rotations in a way as to push mass on the sum 
of the absolute values of the matrix diagonals towards the upper left corner. This 
idea is implemented in Algorithm 3 




The sorting opereation corresponds to a permutation of the columns of the 
Mj, which does not change the value of the objective function. Thus the sorted 
Jacobi algorithm also decreases the objective function value at every iteration, 
which ensures its convergence to a low-rank diagonalizer. However, unlike in the 
single-matrix setting, we do not provide formal guarantees that the resulting 
diagonalizer is optimal; even in the full-rank setting convergence properties of 
the simultaneous Jacobi method remains an open problem. However, in the next 
section we demonstrate empiracally that just as in the full-rank setting, our 
low-rank extension appears to converge for any matrix. 


Data: symmetric matrices 


Let U = I-, 

while objective is decreasing do 
for i = 1,2, ...,k do 

for j = i + l,i + 2,..., d do 

Let 6 be the minimizer of F{Gij W); 

u^uGijiey, 

for I = 1, 2,..., L do 

I Ml ^Gij{efMiGij{ey, 

end 


if Siti > Sf^r then 


Flip columns i and j in U and the Mi ; 


end 


end 


end 


end 

Algorithm 3: The low-rank Jacobi algorithm for simultaneous diagonalization 


3.2 Non-orthogonal setting 

The QRJID algorithm has a very similar structure to Jacobi: over the course of 
a sweep, the matrices Mj are first multiplied by Givens rotation, and then by 
lower triangular matrices. In each case, a rotation only affects two columns and 
two rows of Mj. 

We can similarly sort the diagonal entries of the matrices and zero-out only 
the top k X k square of the Mj. This leads to Algorithm 4. 

4 Asymmetric matrices 


Suppose now that we have a set of L ^ 2 matrices Ai = {Mi}f^^ of the form 

Ml = UAiV^, (5) 





Data: symmetric matrices 
Let [/ = 7; 

while objective is decreasing do 
for i = 1,2, ...,k do 

for j = i + l,i + 2,..., d do 

Let 6 be the minimizer of F{Gij{9))', 

U ^ UGijiO)- 
for I = 1,2,..., L do 

I Ml ^Gij{efM,Gij{ey, 

end 

if Siti l(A)jjl > I!f=i then 

I Flip columns i and j in U and the Mi ; 

end 

end 

end 

for i = 1,2,..., k do 

for j = i + l,i + 2,..., d do 

Let d be the minimizer of F{Aij («)); 

U ^UAij {a)- 
for I = 1,2, ...,L do 
I Ml <— Aij{d)'^MiAij{a)-, 
end 

if Siti \ > ZLi \{^i)n \ then 

I Flip columns i and j in U and the Mi ; 

end 

end 

end 

end 

Algorithm 4: The QRJID algorithm for simultaneous non-orthogonal diago- 
nalization 


where U G and V e are sets of common factors, possibly non- 

orthogonal. 

When the U and V are orthogonal, there is a well-known procedure that 
reduces this problem to the symmetric case by instead diagonalizing the matrices 
M[ = Mf Ml = VA'^V'^. Unfortunately, this reduction does not work for non- 
orthogonal matrices; here, we propose another reduction that works in both 
orthogonal and non-orthogonal cases. 

For each Mi, define another matrix 'j observe that 


^ 0 

Ml 


V V " 

^Ai 0 " 

Y V " 

Ml 

0 


u -u 

0 -di 

u -u 


The (Ni) are symmetric matrices with common (in general, non-orthogonal) 
factors. Therefore, they can be jointly diagonalized and from their components, 
we can recover the components of the (Mi). 















Using the above low-rank algorithms, it is possible to determine the C/, V 
factors in Oldid^) time (assuming d 2 ^ di). This is worse than the 0[d\d2) 
time it takes to determine the SVD of a single matrix. It remains to be seen if 
non-orthogonal joint diagonalization admits algorithms are as fast as ones for 
the ordinary SVD. 

5 Experiments 

5.1 Full-rank matrices 



Fig. 1: Histogram of objective values attained by the Jacobi simultaneous di¬ 
agonalization algorithm for 1000 random sets of jointly diagonalizable matrices 
initialized with various amounts of noise. 


To assess the convergence properties of the Jacobi method, we ran the al¬ 
gorithm 1000 times on different sets of L random matrices {L = d = k = 15) 
corrupted with varying amounts of noise (e = 0, le — 4, le — 3). At each run, we 
measured the objective value function (the Frobenius norm of the off-diagonal 
elements), and plotted the resulting histogram (Figure 1). 

Overall, we observe that the objective values for a Gaussian distribution for 
the three settings e = 0, le — 4, le — 3. All data points are concentrated in an 
interval that depend either on the convergence threshold (le — 12) in the e = 0 
case or on the noise level (when e = le — 3, le — 4). Most interestingly, when 
e = 0, the mean of the observations is around le — 10 (which corresponds to 
the precision of le — 12 multiplied by the ~ 200 entries in the matrices), and 
e > 0, the mean depends on the noise level. Quite strikingly, multiplying the 
noise by ten produces essentially the same histogram within the same bounds, 
except that it has been translated from between 0.035 and 0.045 to between 0.35 
and 0.45. 

This observation strongly suggests that the Jacobi algorithm attains a glob¬ 
ally optimal solutions even at high noise levels. 












5.2 Low-rank matrices 


5.3 Asymmetric matrices 

6 Extension of perturbation analyses 

We conclude our discussion with some perturbation results for noisy matrices of 
the form 


Ml = UAiU^ + eRi ( 6 ) 

for some e > 0 and some matrix Ri having unit norm. 

A perturbation analyses has been carried out for matrices of this form in 
both orthogonal and non-orthogonal settings. Our first two lemmas state that 
these analyses carry over to the low-rank case as well. The fact that they also 
carry over to the asymmetric case follows trivially from our reduction. 

Lemma 1 ([4]). Let Mi = UAiU~^ + eRi, I G [L], be matrices with common 
factors U G and diagonal Ai G Let U G be a full-rank extension 

of U with columns ui,U 2 ,... ,Ud and let U G be the orthogonal minimizer 

of the joint diagonalization objective F{-). Then, for all uj, j G \k\, there exists 
a column Uj of U such that 


where E is 


d 


Uih «£ e 




T.^+o{e). 


- Xji)ujRiUi 


(7) 


( 8 ) 


when i ¥= j and i ^ k or j ^ k. We define Eij = 0 when i = j and Xu = 0 when 
i > k. 


Proof. See Proposition 1 in [4]. Note that in the low rank setting, the entries 
of Eij (Equation 15 in [4]) where i,j > k are not defined, however, these terms 
only effect the last d — k columns of JJ . The bounds for vectors ui,...,Uk only 
depend on Eij where i G [d] and j G [A:], and these are derived in the low-rank 
setting in the same way as they are derived in the full-rank proof of [4]. ■ 


We now present the corresponding perturbation bounds in [2] to the low rank 
setting. 

Lemma 2 ([2]). Let Mi = UAiU^ + eRi, I G [L], be matrices with common 
factors U G and diagonal Ai G R^^^. Let U G he a full-rank extension 

of U with columns ui, U 2 ,... ,Ud and let V = , with rows vi, V 2 ,... ,Vd. Let 

V G R‘^^^ be the minimizer of the joint diagonalization objective E{-) and let 

u = y-i. 





Then, for all Uj, j £ [fc], there exists a column uj of U such that 


^jl|2 


< e 


\ 


Yj 


( 9 ) 


where the entries of E e satisfy the equation 


Eij 

-1 

Vij Pij 

Tij 

Eji 

1 

1 

-Pij Vfj\ 

Tji 


when i ¥= j and either i ^ k or j ^ k. When i = j, Eij = 0. The matrix T has 
zero on-diagonal elements, and is defined as 

Tij ='Y'^i 

i 

and the other parameters are 

7., = IIAdbllA.b, = 1^, 

IIAilb 

We define Xu = 0 when i > k. 

Proof. In Theorem 3 in [2] it is shown that V = {I + eE)V + o(e), where Eij is 
defined for i,je [d] (Equation 36 in [2]). Then, 

U = U{I + eE)-^ + o{€) 

= U{I - eE) + o{e). 


fori ^ j i ^ d 


Pij — 


A A, 


I|a,1|2||a.||2’ 


(Ai)fc — Xik- 


Note that, once again, in the low rank setting, the entries of Eij when i,j>k 
are not characterized by Afsari’s results; however, these terms only effect the 
last d — k columns of t/. ■ 


Lemma 3. Let Mi = UAiU~^ + eRi, I £ [L], be matrices with common factors 
U £ and diagonal Ai £ . Let U £ he a full-rank extension of 

U with columns Ui,U 2 , ■ ■ ■ ,Ud and let V = U~^, with rows Vi,V 2 , ■ ■ ■ ,Vd. Let 
V £ be the minimizer of the joint diagonalization objective F{-) and let 

U = V-K 

Then, for all Uj, j £ [fc], there exists a column Uj of Lf such that 


IIM, - M,||2 < Ca 


Y 




( 10 ) 


where the entries of E e are bounded by 


\Eij I ^ 


1-pI Vl|A.||i llA.Ili 


YvjRiVjX^ 




1 = 1 




^ vjRiVjXii 


1=1 






















when i ¥= j and Eij = 0 when i = j and Xu = 0 when i > k. Here \i = 

T A 

(Ail) Aii) G and pij = | | _^ | |*^ | |^ | |^ is the modulus of uniqueness, a mea¬ 

sure of how ill-conditioned the problem is. 

Proof. From Lemma 2, we have that 


Ei- 

e' 


'jt 


< 


Vij Vji 
7ti(l - P%) 


31 


where 


li] - ||Ai||2||Aj||2, 


Pi] = 


II Adi 


llAjIb’ ||Aj||2||Ad|2 ’ 

and the matrix T is defined to be zero on the diagonal and for i ^ j defined as 

L 


AdA^- 


Tij — ^ vjRiVjXji, 


fori ^ j ^ i ^ d 


1=1 


Taking || • || to be the Zi-norm in the above expression, we have that 

if;,,! ss if;.,I + |F;,d < (|t.,| + |T,d). 


Since 


and 


71 * (1 - Pij) 

Vij + Pji _ ||A.||f + ||A,-j|2 _ 1 


7j* 


l|Adlil|A,||i 

L 


+ 


l|A*||i ||A,lli 

Tij = ^ RiVjXji, 


1=1 


the claim follows. 


These results hold for the joint matrix diagonalizer, i.e. the global optimum 
of the objective E. It is not clear whether this optimum can be attained by 
existing algorithms such as Jacobi, which limits the theoretical applicability of 
the above lemmas. Although empirical results strongly indicate that the global 
minimizer is indeed always found in the orthogonal case, we also complement 
these results with the lemma below. 

Lemma 4. Suppose the Jacobi simultaneous diagonalization algorthim is ini¬ 
tialized with U, the set eigenvectors obtained from diagonalizing a single matrix. 
Then for e > 0 small enough, Jacobi will converge to a point at which Lemmas 
1 and 2 hold. 

Proof. See appendix. ■ 

The above lemmas can be used to derive theoretical guarantees for algorithms 
that use them as a subroutine. 
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A Extension of perturbation analyses 

In this section, we analyze the convergence of the Jacobi method. We argue that 
for small enough e > 0, Jacobi will converge to a point at which our perturbation 
lemma holds. The results in this section complement our empirical assessment 
of the convergence of the method. 

A.l Notation 

Let F{V,Ai) : 0{n) 1R+ be the orthogonal joint diagonalization criterion for 

a set of matrices A4 = {Mi, • • • , Ml}, 

L 

F(V,M) = 2 \\oS{V^MiV)\\j, 

i=l 

L 

= Y, \\V^MiV\\l - WdisigiV^MiV)\\l. 

i=l 

Let AJe = {U^DiU + be the set of matrices given as input. Note that 

when e = 0 , the set of matrices commute and are jointly diagonalizable, with 
diagonalizer U. For simplicity, we will let the set of matrices U~^DiU be fixed 
and use the shorthand F{V,e) to denote F{V,M.e)- 
Several observations should be made about F{V, e): 

— F{V, e) is continuous in V, e. Since 0{n) is compact, for any fixed e, F{V, e) 
is uniformly continuous. 

— If we restrict e to lie on a closed interval I, then F{V, e) : 0{n) x / ^ 1R+ is 
uniformly continuous with both variables and with the metric |I • ||^ on (V, e) 
defined as IKV, e)||' = ||F|| + |e|. 

— Each global minimizer U of F(y,0) is isolated. We let 7 > 0 denote the 
radius of the ball 

r:={V:\\V-U\\<^} 

around U on which F(y, 0) > F{U, 0) for all V e F. 

Finally we introduce a few additional pieces of notation. We let V{e) min- 
imizers of F(y,e) on the set F; such minimizers always exist by compactness 
of F{V,e). We also define 17(e) to be a global minimizer of F{V,e)-, note that 
17(0) = U. Let d{V) denote the Jacobi step taken from V: if E+ = Jacobi(E), 
then d{V) = (E+ — V). Finally, let IF(e) to be the joint diagonalizer obtained 
from diagonalizing a single matrix in Ale. 

A.2 Outline 

Our goal is to show that if Jacobi is initialized with W{e), then all sufficiently 
small e, it will will converge to a point at which Cardoso’s lemma holds. This 
involves several steps: 

1. Showing that Jacobi converges to a point close to an unperturbed local 
minimum 17. 

2. For all such points that are close to 17, Cardoso’s lemma holds. 



A.3 Convergence of Jacobi to a neighborhood of U 


Our first lemma states that the steps taken by Jacobi get arbitrarily small when 
the objective value is close to optimal. 

Lemma 5 (Vanishing steps). Let U{e) be a global minimizer of F{V,e). For 
all J > 0, there is at > 0 such for all e > 0, whenever \F{U (e), e) — F{V, e)| <t, 
\\d{V)\\<5. 

Proof. Follows from some algebra (see Maleko, 2003). ■ 

Next, we want to establish local convergence results for Jacobi: when it is 
started close to a local minimizer, it converges to that minimizer. 

First, we will need to show that the step sizes Jacobi takes are small around 
certain local minimizers of F{V, e). In particular, when a local minimizer V{e) is 
close to U, we can use the fact that the steps around U are small to show that 
steps around V are small as well. This is what the following lemma establishes. 

Lemma 6 (Vanishing steps around local minima). For all j > S > 0, 

there exists an cq > 0 and an r > 0 such that for all 0 e ^ cq, if V{e) is a 
minimizer ofF{V,e) on the set F = {V : \V — U\ ^ 7 } satisfying ||[/ —V(e)|| < r, 
then we have 

\\V-V{e)\\ + \\diV)\\<6 

whenever ||V — V(e)|| < r. 

Proof. By Lemma 5, there exists a t > 0 such that for any e ^ 0, \F{U{e), e) — 
F{V,e)\ < t implies that ||d(V)|| < S/2. 

Choose r > 0 such that r < 5/2 and such that for all \\V — U\\ < 2r, we have 
|F"(17(0),0) — F{y,0)\ < t/A. Choose eo > 0 such that for all V e 0(n) and all 
0 ^ e ^ eo, we have \\F{V, 0) — F{V, e)|| < t/A. 

Then, when ||V —V(e)|| < r, we have ||V —17|| ^ ||V —V(e)|| + |lV(e) —C/|| ^ 
2r, and thus |F(t7(0), 0) - J^(V, 0)| < t/A. 

Next, for 0 ^ e ^ eg, and for the same V as above we have 

F{V, e) - F{U{e), e) ^ \F{V, e) - F{V, 0)| + \F{U{e),e) - F{U{e), 0)| 

+ |F(C/(0), 0) - F(C/(e), 0)1 + \F{V, 0) - i^(t/(0), 0)| 

< t/Apt/APt/APt/A = t. 

By definition of t, ||d(C)]| < 5/2 and 

||V - V(e)|| + ||d(V)|| < 5/2 + 5/2 ^ J. 


Next, we want to show for minima V(e) close to U, there is an open set 
around each V{e) that “captures” the Jacobi iterates such that they never leave 
that set. 


Here is the intuition behind the proof. Suppose V (e) is a local minimum. 
Consider a connected level set L around ^(e). Because our algorithm always 
decreases the objective function, if we start in L then we shouldn’t leave it. The 
only way we can leave L is if we take a step size large enough to move us far 
from the local minimum V(e) to an region where the objective function value is 
smaller than in L. However, if we define a set S that is contained in L and in 
which the step sizes are always sufficiently small, then it is possible to show that 
the iterates will never leave S. 

Lemma 7 (Convergence ball). For all -j > S > 0, there exist eo,r,s > 0 
such that for all 0 e ^ cq, and for all local minima V{e) of F(V^ e) on the set 
F = {V •. \V — U\ < ^] satisfying \\U — l^(e)|| < r, if we start the algorithm at a 
point Vk in the set 

DiV{e),e, S) = {||H - V{e)\\ < d; F{V, e) < F{V{e),e) + s}. 
then we will also have Vk+i 6 D(V{e),e,S). 

Proof. By Lemma 6 , there exists an eo > 0 and an r > 0 such that for all 
0 ^ e ^ Co and for ||1^ — l^(e)|| < r, where V{e) is a local minimizer of F{V,e) 
satisfying ||17 — ld(e)|| < r, we have 

\\V-V{e)\\ + \\d{y)\\<5. 

Let (j>{t,e) = inft<||y_y(e)||; ygr i^(C, e) — F{V{e),€) ^ 0 and observe that 
4i{t, e) is monotonically increasing in t. Let s = infosjesjeo (/)(r, e) and consider the 
set 

DiVie),e, S) = {||y - H(e)I| < d; FiV, e) < FiVie),e) + s}. 

Suppose that 14 £ D{V{e),€,S). Since 

f^iWVk - l^(e)ll) ^ F{Vk,e) - F(y{e),e) < s ^ e) 

and 4> is monotonically increasing, we have ||14 — l^(e)|| < r, and thus || 14 +i — 
y(e)|| < 6. Also, F(Vfc+i,e) ^ F{Vk,e), so 

F( 14 +i,e)-F(t/(e),e) <s. 


Thus Vk+ieD{V{e),e,5). ■ 

The next lemma says that for small enough e, most of these “capture” sets 
also contain the unperturbed minimizer U. 

Lemma 8 (Convergence balls in the vicinity of global optima). For all 

7 ^ (5 > 0, there exists an open set C around U, as well as eo,t > 0 such that 
for aZl 0 ^ e ^ eo, for all 0 ^ d Sq, and for all local minimizers V{e) such that 
\\U-Vie)\\<t,C^DiVie),e,S). 


Proof. Let <5 > 0 be fixed. Recall from Lemma 7, that the sets D{V (e), e, S) have 
the form 


D{V{e),e, S) = {||R - R(e)l| < <5; F{V, e) < F{V{e),e) + s}, 

where s is a constant independent of e. These sets are defined for all local minima 
V{e) such that ||R(e) —17|| < r, for some r > 0. Let ei be the upper bound on e 
that is given by the lemma. 

Let t' be such that by the uniform continuity of /, we have \F{U, e) — 
F{V{e),e)\ < s/4 for ||17 —R(e)|| < t' and for all 0 ^ e ^ ei. Let €2 be such that 
for 0 ^ e,e' ^ £ 2 , ||F(R, e) — F{y,e')\\ < s/4 for all V such that ||t4 — 17|| < S. 
Finally let Cq = min(ei,e 2 ) and let t = min(t',r,S/2). 

Define the set C as 

C= {||R-17|| < S/2; F(V,0) < F(U,0) + s/4}. 

For V e C and for 0 ^ e ^ cq, we have 

\FiV, e) - FiVie), e)| < \F{V, 0) - F^V, e)| + |F(f/, e) - F(R(e), e)| 

+ \F{U, 0) - F(C/, e)| + |F(R, 0) - F{U, 0)| 

< s/4 + s/4 + s/4 + s/4 = s. 


Moreover, for V e C, 

||y _ y(e)|| ^ ||y _ c/|| + \\u - F(e)|| < S/2 + S/2 = 5. 

This establishes that C c D{V{e),e,S). ■ 

Thus the sets D{V (e), e, S) have non-empty interior and the size of that inte¬ 
rior is bounded from below. Next, we would like to show that for small enough 
perturbations, there exist local minima V (e) of F{V, e) that are arbitrarily close 
to U. 

Lemma 9 (Existence of close local minima). For allj > S > 0, there exists 
an €0 > 0 such that for 0 ^ e ^ cq, there is a V{e) such that ||R(e) — {7|| < S 
and V(e) is a local minimizer of FiV.e) in the sense that F(V(e),e) ^ F(V,e) 
on the set{V: \\V-V{e)\\ < 7 }. 

Proof. By compactness, for every e, there exists a minimum V (e) of F(/V, e) on 
the set {R : ||?7 — E|| ^ 7 }. Suppose that the claim of the lemma does not 
hold; then there is a 7 > > 0 and a sequence of R(e„),en with e„ ^ 0 and 

S' ^ ||R(en) — C^ll ^7 such that 

FiVien),en)-F{U,en) ^0 

for all n. Taking limits, we find that F(V, 0) ^ F{U, 0), where Y = lim„^oo V (cn) 
and S' ^ ||y — ?7|| ^ 7 , contradicting the fact that U was the only minimum on 

F. m 

The next lemma deals with the initializer W(e). 


Lemma 10 (Diagonalization of a single matrix). Suppose the eigenvalues 
of the Mo are distinct. Let e > 0, and let W (e) be the estimate of the eigenvalues 
obtained by diagonalizing one of the Me- As e 0, W{e) —>■ U. 

Proof. Follows from the fact that eigenvectors and eigenvalues of M + eD are 
continuous in e for small enough e. I 

We use all of the above results to establish the following important lemma 
says that we can converge arbitrarily close to a global optimum for small amounts 
of noise. 

Lemma 11 (Convergence to vicinity of global minimum). There exist 
(5o > 0 and eg > 0 such that for all 0 ^ S < Sq, there exists an 0 ^ e < cq 
such that when the Jacobi algorithm is initialized at a point W (e) obtained from 
diagonalizing one matrix will almost surely converge to a local minimum V of 
F{-,Me) with \\V-U\\ < 26. 

Proof. By Lemma 8 , for all 7 > 5 ^ 0, there exists an open set C around U, 
as well as ei,t > 0 such that for all 0 sg e ^ ei, for all 0 ^ ^ and for 

all local minimizers ^(e) such that \\U — F(e)|| < t, C D{V{e),€,S), where 
D{V{€),e, ^) is a set defined in Lemma 7. 

Let one such <5 > 0 be fixed. By Lemma 9, there exists an £2 > 0 such that for 
0 < e ^ £ 2 , there exists a local minimizer V(e) such that | |y(£) — C/|| < min(t, 6). 

Now let /3 > 0 be the radius of a ball around [/ that is contained in C. By 
Lemma 10, there exists an £3 > 0 such that for 0 < £ ^ £ 3 , ||1F(£) — U\\<(3. 
Note that the lemma holds because we obtain eigenvalues by random projection, 
and hence they will be distinct almost surely. 

Since | |bF(£) — 17| | < /3, W^e) G C. Suppose that we start the algorithm on the 
function F{V,e) at W{e). Then since S c D{y{e),e,5), all subsequent iterates 
will be in D{V{e),€,S) as well. By compactness, the iterates will converge to a 
point Y satisfying ||F(£) — TH < <5. Furthermore, since ||17 — l^(e)|| < min(t, 5), 
we have 

\\U-Y\\ < 26. 

Thus the lemma holds with (5o := 7 and £0 = min(£i, £ 2 , £ 3 ). 


A.4 Cardoso’s lemma holds in a neighborhood of U 

Finally, we will make the argument that Cardoso’s lemma holds for the points 
to which the Jacobi method will converge. 

First, recall that any orthogonal matrix V can be written as U exp(£’) for 
some skew-symmetric matrix E. The fact that V = {I + E)U + o{E) follows from 
this observation by using the series form of the matrix exponential exp(iil). 

Cardoso’s result holds for any V = {I + eE')U + o{eE'), where E = 0(1). 
Thus, it is enough to show that the E defined above satisfies E = 0{t). To prove 
this, we will use the following lemma by Cardoso: 


Lemma 12 (Cardoso). If V is critical point of F{-,A4), then S{V,A4) = 
where 

SiV, M) := 2 2 (ef M^Ve,){eie^V - Ve,eJ) 

k i^j 

Lemma 13. Let V = (I + E)U + o{E) be the expanded expression of a local 
minimizer V{e) of E{-,A4e)- Then ||i?|| = 0(e) as e —* 0. 

Proof. Before we proceed, observe that the faction that V converges to a global 
minimizer U by our previous lemma and thus i? ^ 0 as e 0. All we need to 
establish is the rate. For this, we will use Lemma 12. Recall that R is a critical 
point of F whenever S{y,M.) = S{V,A4)'^, where 

SiV,M) := ^ Y.{eJV^MkVe,){eieJV^M^V - V^M^VeicJ). 

k ijij 

It is shown by Cardoso (1994) that 

{S{V,M,)-SiV,M,f)^J=2eY,id^{k)-dJ{k))eJU^RkUeJ+2E,,Y,id^ik)-dJik)f+o{^)+oiE^J). 

k k 


Thus, we can define 


hj{E,j,e) = {S{V,Mf) - S{V,Mefh- 

Let e„ be any sequence such that R(e„) 1/ as e„ —> 0 and let e\'^'^ be the 

(i,j)-ih. element of the matrix En defined in = (J + En) + o{En). Combining 
this with Cardoso’s expansion and using the fact that C(e(„)) is a minimizer, 
we have that 

0 = AijE^^j ^ + BijCn + o(Eij) + o(e), 

for all n, where Aij and Bij are the constants from Cardoso’s expansion. This 
immediately establishes that Eij = 0{e) (with constant B^fAif). The fact that 
nil'll = 0(e) follows by combining the result for each (i,j). 


The above lemma establishes that Lemma 1 to holds for points attained by 
the Jacobi algorithm. 


